Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

Web scraping may seem like a simple task, but there are many challenges to overcome. In this blog, we will dive into how to scrape LinkedIn to extract job listings. To do this, we will use Puppeteer↗ and RxJS.↗ The↗ goal is to achieve web scraping in a declarative, modular, and scalable manner.
What is Web Scraping?
Web scraping is an automated method of extracting valuable data from websites. It allows users to retrieve specific information—such as text, images, links, and structured content—without manually copying and pasting. This technique is widely used for various purposes, including market research, data analysis, job listings aggregation, and competitive intelligence.
By leveraging web scraping tools, developers can efficiently collect, process, and utilize web data, transforming unstructured online information into structured insights.
Puppeteer: A Powerful Web Scraping Tool
Puppeteer is a JavaScript library that provides programmatic control over headless or full browsers like Chrome. It allows developers to automate tasks such as navigating web pages, interacting with elements, and extracting data, making it an excellent choice for web scraping.
One of Puppeteer's biggest advantages is its ability to handle dynamic content. Unlike traditional scraping techniques that rely solely on fetching raw HTML, Puppeteer can execute JavaScript, ensuring that all elements—including those loaded asynchronously—are properly rendered before extraction. This makes it particularly useful for scraping websites with complex structures or content hidden behind interactive elements.
Understanding RxJS
RxJS is a powerful JavaScript library designed for reactive programming, making it easier to handle asynchronous data streams efficiently. In this project, we leverage RxJS due to its numerous advantages:
✅ Streamlined Asynchronous Workflow – Enables a declarative approach to managing async operations.
✅ Robust Error Handling – Provides built-in mechanisms to catch and handle errors gracefully.
✅ Effortless Retry Logic – Allows automatic retries when scraping issues arise.
✅ Flexible and Scalable Code – Simplifies adaptation as project complexity grows.
✅ Extensive Operator Support – Offers a rich set of functions to process and manipulate data efficiently.
1. Puppeteer initialization↗
The↗ code snippet below initialize↗s a↗ Puppeteer browser insta↗nce↗ in a non-headless mo↗de a↗nd subsequently creates a↗ new↗ web page. This represents the most fundamental and straightfo↗rwar↗d initialization process↗ for↗ Puppeteer:
src/index.ts
ts1(async () => { 2 console.log('Launching Chrome...'); 3 const browser = await puppeteer.launch({ 4 headless: false, 5 // devtools: true, 6 // slowMo: 250, // slow down puppeteer script so that it's easier to follow visually 7 args: [ 8 '--disable-gpu', 9 '--disable-dev-shm-usage', 10 '--disable-setuid-sandbox', 11 '--no-first-run', 12 '--no-sandbox', 13 '--no-zygote', 14 '--single-process', 15 ], 16 }); 17 18 const page = await browser.newPage() 19 20 /** 21 * 1. Go lo linkedin jobs url 22 * 2. Get the jobs 23 * 3. Repeat step 1 with other search parameters 24 */
2. Accessing LinkedIn Job Listings and Extracting Data
This is the core section of our blog, where we delve into the process of navigating LinkedIn’s job listings, parsing the HTML content, and extracting job details in a structured JSON format. Our approach ensures that we retrieve relevant job information efficiently while handling potential roadblocks during the scraping process.
2.1. Construct the URL for navigating to LinkedIn job offers page
To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage
:
src/linkedin.ts
ts1export const urlQueryPage = (searchParams: ScraperSearchParams) => 2 `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText} 3 &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
In this case, I have already conducted the necessary research to identify a suitable URL for scraping. Our goal is to find a URL that can be dynamically parameterized based on our desired search criteria.
For this example, the key search parameters will include:
-
searchText
– The job title or keyword. -
pageNumber
– The pagination index to navigate through job listings. -
locationText
(optional) – A specific location filter to refine search results.
By structuring the URL accordingly, we can efficiently retrieve job listings that match our specified criteria.
Examples of url can be:
-
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Angular&start=0↗
-
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&start=0↗
2.2. Navigate to the URL and extract the job offers
With our target URL identified, we can proceed with the two primary actions required:
-
Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.
-
Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.
src/linkedin.ts
ts1 2export interface ScraperSearchParams { 3 searchText: string; 4 locationText: string; 5 pageNumber: number; 6} 7 8/** main function */ 9export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> { 10 return defer(() => fromPromise(navigateToJobsPage(page, searchParams))) 11 .pipe(switchMap(() => getJobsFromLinkedinPage(page))); 12} 13 14/* Utility functions */ 15export const urlQueryPage = (searchParams: ScraperSearchParams) => 16 `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText} 17 &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}` 18 19function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> { 20 return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' }); 21} 22 23export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */]; 24 25export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> { 26 return defer(() => fromPromise(page.evaluate((pageEvalData) => { 27 const collection: HTMLCollection = document.body.children; 28 const results: JobInterface[] = []; 29 for (let i = 0; i < collection.length; i++) { 30 try { 31 const item = collection.item(i)!; 32 const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim(); 33 const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || ''; 34 const remoteOk: boolean = !!title.match(/remote|No office location/gi); 35 36 const url = ( 37 (item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement) 38 || (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement) 39 ).href; 40 41 const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0]; 42 const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href; 43 const companyName = companyNameAndLinkContainer.textContent!.trim(); 44 const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim(); 45 46 const toDate = (dateString: string) => { 47 const [year, month, day] = dateString.split('-') 48 return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day) ) 49 } 50 51 const dateTime = ( 52 item.getElementsByClassName('job-search-card__listdate')[0] 53 || item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case. 54 ).getAttribute('datetime'); 55 const postedDate = toDate(dateTime as string).toISOString(); 56 57 58 /** 59 * Calculate minimum and maximum salary 60 * 61 * Salary HTML example to parse: 62 * <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span> 63 */ 64 let currency: SalaryCurrency = '' 65 let salaryMin = -1; 66 let salaryMax = -1; 67 68 const salaryCurrencyMap: any = { 69 ['€']: 'EUR', 70 ['$']: 'USD', 71 ['£']: 'GBP', 72 } 73 74 const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0] 75 if (salaryInfoElem) { 76 const salaryInfo: string = salaryInfoElem.textContent!.trim(); 77 if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) { 78 const coinSymbol = salaryInfo.charAt(0); 79 currency = salaryCurrencyMap[coinSymbol] || coinSymbol; 80 } 81 82 const matches = salaryInfo.match(/([0-9]|,|\.)+/g) 83 if (matches && matches[0]) { 84 // values are in USA format, so we need to remove ALL the comas 85 salaryMin = parseFloat(matches[0].replace(/,/g, '')); 86 } 87 if (matches && matches[1]) { 88 // values are in USA format, so we need to remove ALL the comas 89 salaryMax = parseFloat(matches[1].replace(/,/g, '')); 90 } 91 } 92 93 // Calculate tags 94 let stackRequired: string[] = []; 95 title.split(' ').concat(url.split('-')).forEach(word => { 96 if (!!word) { 97 const wordLowerCase = word.toLowerCase(); 98 if (pageEvalData.stacks.includes(wordLowerCase)) { 99 stackRequired.push(wordLowerCase) 100 } 101 } 102 }) 103 // Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts 104 const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos); 105 stackRequired = uniq(stackRequired) 106 107 const result: JobInterface = { 108 id: item!.children[0].getAttribute('data-entity-urn') as string, 109 city: companyLocation, 110 url: url, 111 companyUrl: companyUrl || '', 112 img: imgSrc, 113 date: new Date().toISOString(), 114 postedDate: postedDate, 115 title: title, 116 company: companyName, 117 location: companyLocation, 118 salaryCurrency: currency, 119 salaryMax: salaryMax, 120 salaryMin: salaryMin, 121 countryCode: '', 122 countryText: '', 123 descriptionHtml: '', 124 remoteOk: remoteOk, 125 stackRequired: stackRequired 126 }; 127 console.log('result', result); 128 129 results.push(result); 130 } catch (e) { 131 console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack); 132 } 133 } 134 return results; 135 }, {stacks})) as Observable<JobInterface[]>) 136} 137
The code provided extracts the information of all jobs from the page. While it may not be the most aesthetically pleasing code, it gets the job done. It is not aesthetic because parsing this type of HTML always leads to many fallbacks and checks.
In a standard programming context, breaking code into smaller, isolated functions improves readability and maintainability. However, when working with
page.evaluate
in Puppeteer, we face certain limitations. Since this code executes within the Puppeteer (Chrome) instance rather than our Node.js environment, all logic must be self-contained within thepage.evaluate
call.The only exception is simple variables (such as stacks in our case), which can be passed as arguments to
page.evaluate
. However, these variables must not contain functions or complex objects that cannot be serialized, as Puppeteer does not support passing non-serializable data between Node.js and the browser context.
In this case, the most challenging part of scraping is extracting salary information, as it requires converting a text format like "$65,000.00 - $90,000.00" into separate salaryMin
and salaryMax
values.
To handle potential issues gracefully, we have encapsulated the entire code within a try/catch block. While we currently log errors to the console, it is highly recommended to implement a mechanism for storing error logs on disk. This is especially important because websites frequently update their structure, requiring regular adjustments to the HTML parsing logic.
Finally, we consistently use the defer
and fromPromise
operators to convert Promises into Observables, ensuring a reactive and efficient data flow throughout the scraping process.
typescript1defer(() => fromPromise(myPromise()));
This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link↗ for more information about it
3. Add an asynchronous loop to iterate through all pages
In the previous step, we learned how to obtain all job offers data from a LinkedIn page. Now, we want to use that code as many times as possible to gather as much data as we can. To achieve this, we first need to iterate through all available pages:
src/linkedin.ts
ts1function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> { 2 const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe( 3 map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)), 4 catchError(error => { 5 console.error(error); 6 return of({jobs: [], searchParams: searchParams}) 7 }) 8 ); 9 10 return getJobs$(initSearchParams).pipe( 11 expand(({jobs, searchParams}) => { 12 console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`); 13 if (jobs.length === 0) { 14 return EMPTY; 15 } else { 16 return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1}); 17 } 18 }) 19 ); 20}
The code above increments the page number until we reach a page where there are no jobs. To perform this loop in RxJS, we use the operator expand
, which recursively projects each source value to an Observable that is merged into the output Observable. Its functionality is well explained here↗.
In RxJS, we cannot use a
for
loop as we do with await/async. We are required to use another technique likeexpand
operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.
So, what would the equivalent code using Promises look like? Here's an example:
typescript1export async function getJobsFromAllPages( 2 page: Page, 3 searchParams: ScraperSearchParams 4): Promise<ScraperResult> { 5 const results: ScraperResult = { jobs: [], searchParams }; 6 7 try { 8 while (true) { 9 const jobs = await getJobsFromLinkedinPage(page, searchParams); 10 console.log( 11 `Linkedin - Query: ${searchParams.searchText}, Location: ${ 12 searchParams.locationText 13 }, Page: ${searchParams.nPage}, nJobs: ${ 14 jobs.length 15 }, url: ${urlQueryPage(searchParams)}` 16 ); 17 18 results.jobs.push(...jobs); 19 20 if (jobs.length === 0) { 21 break; 22 } 23 24 searchParams.nPage++; 25 } 26 } catch (error) { 27 console.error('Error:', error); 28 results.jobs = []; // Clear the jobs in case of an error. 29 } 30 31 return results; 32}
This code is nearly equivalent to the Observable-based one, with one critical difference: it only emits when all pages have finished processing. In contrast, the implementation using Observables emits after each page. Creating a stream is crucial in this case because we want to handle the jobs as soon as they are resolved.
Certainly, we could introduce our logic following the line:
typescript1const jobs = await getJobsFromLinkedinPage(page, searchParams); 2 3/* Handle the jobs here */
...but this would unnecessarily couple our scraping code with the part that handles the jobs data. Handling the jobs data may involve some transformations, API calls, and finally, saving the data into a database.
In this example, we clearly see one of the many benefits Observables offer over Promises.
4. Implementing an Asynchronous Loop for Multiple Search Parameters
Now that we've established how to iterate through multiple pages for a given search query, it's time to take the next step: expanding our search across multiple search parameters.
To achieve this, we'll introduce an additional asynchronous loop that cycles through various search criteria, ensuring comprehensive data extraction.
The first step is defining a structured data format to store these search parameters. We'll call this list searchParamsList
, which will hold different combinations of keywords, locations, or other relevant filters:
src/data.ts
ts1const searchParamsList: { searchText: string; locationText: string }[] = [ 2 { searchText: 'Angular', locationText: 'Barcelona' }, 3 { searchText: 'Angular', locationText: 'Madrid' }, 4 // ... 5 { searchText: 'React', locationText: 'Barcelona' }, 6 { searchText: 'React', locationText: 'Madrid' }, 7 // ... 8];
To iterate through the searchParamsList
array, we essentially need to convert it from an Array to an Observable using the fromArray
operator. Subsequently, we will use the concatMap
operator to sequentially process each searchText and locationText pair. The power of RxJS here is that, in the case where we may want to switch from sequential to parallel processing, we just need to change the concatMap
for a mergeMap
. In this case, it is not recommended because we will exceed LinkedIn's rate limits, but it's something to consider in other scenarios.
src/linkedin.ts
ts1/** 2 * Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages. 3 * @param browser A Puppeteer instance 4 * @returns An Observable that emits scraped job offers data as ScraperResult 5 */ 6export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> { 7 // Create a new page 8 const createPage = defer(() => fromPromise(browser.newPage())); 9 10 // Iterate through search parameters and scrape jobs 11 const scrapeJobs = (page: Page): Observable<ScraperResult> => 12 fromArray(searchParamsList).pipe( 13 concatMap(({ searchText, locationText }) => 14 getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 }) 15 ) 16 ) 17 18 // Compose sequentially previous steps 19 return createPage.pipe(switchMap(page => scrapeJobs(page))); 20}
This code will loop through different search parameters, retrieving job listings for each combination of technology and location efficiently.
🎉 Congratulations! You now have the skills to scrape LinkedIn job postings! 🎉
However, like many other platforms, LinkedIn employs anti-scraping measures to prevent automated data extraction. Let’s explore how to handle these challenges 👇
Common Errors When Scraping LinkedIn
Running the code as it is will quickly lead to various errors, making it challenging to scrape a substantial amount of data. The two most common issues are:
1. 429 Status Code (Too Many Requests)
This error occurs when we send too many requests in a short period. To avoid being blocked, we need to slow down the request rate and introduce random delays until the error subsides.
2. LinkedIn Authwall
Occasionally, instead of the job listings page, LinkedIn may redirect us to an authentication wall. When this happens, the best approach is to pause requests for a while before trying again.
Handling 429 Errors & LinkedIn Authwall
To tackle these issues, we modify the getJobsFromLinkedinPage
function by introducing a separate function, getLinkedinJobsFromJobsPage
, to handle the HTML scraping logic. The updated code structure looks like this:
src/linkedin.ts
ts1const AUTHWALL_PATH = 'linkedin.com/authwall'; 2const STATUS_TOO_MANY_REQUESTS = 429; 3const JOB_SEARCH_SELECTOR = '.job-search-card'; 4 5function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> { 6 return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'}))) 7 .pipe( 8 switchMap(() => navigateToLinkedinJobsPage(page, searchParams)), 9 tap(response => checkResponseStatus(response)), 10 switchMap(() => throwErrorIfAuthwall(page)), 11 switchMap(() => waitForJobSearchCard(page)), 12 switchMap(() => getJobsFromLinkedinPage(page)), 13 retryWhen(retryStrategyByCondition({ 14 maxRetryAttempts: 4, 15 retryConditionFn: error => error.retry === true 16 })), 17 map(jobs => Array.isArray(jobs) ? jobs : []), 18 take(1) 19 ); 20} 21 22/** 23 * Navigate to the LinkedIn search page, using the provided search parameters. 24 */ 25function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) { 26 return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'}))); 27} 28 29/** 30 * Check the HTTP response status and throw an error if too many requests have been made. 31 */ 32function checkResponseStatus(response: any) { 33 const status = response?.status(); 34 if (status === STATUS_TOO_MANY_REQUESTS) { 35 throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS}; 36 } 37} 38 39/** 40 * Check if the current page is an authwall and throw an error if it is. 41 */ 42function throwErrorIfAuthwall(page: Page) { 43 return getPageLocationOperator(page).pipe(tap(locationHref => { 44 if (locationHref.includes(AUTHWALL_PATH)) { 45 console.error('Authwall error'); 46 throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true}; 47 } 48 })); 49} 50 51/** 52 * Wait for the job search card to be visible on the page, and handle timeouts or authwalls. 53 */ 54function waitForJobSearchCard(page: Page) { 55 return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe( 56 catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error}))) 57 ); 58}
In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.
To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition
↗ function:
src/scraper.utils.ts
ts1export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: { 2 maxRetryAttempts?: number, 3 scalingDuration?: number, 4 retryConditionFn?: (error) => boolean 5} = {}) => (attempts: Observable<any>) => { 6 return attempts.pipe( 7 mergeMap((error, i) => { 8 const retryAttempt = i + 1; 9 if ( 10 retryAttempt > maxRetryAttempts || 11 !retryConditionFn(error) 12 ) { 13 return throwError(error); 14 } 15 console.log( 16 `Attempt ${retryAttempt}: retrying in ${retryAttempt * 17 scalingDuration}ms` 18 ); 19 // retry after 1s, 2s, etc... 20 return timer(retryAttempt * scalingDuration); 21 }), 22 finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized')) 23 ); 24};
This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again
⚠️ Important Note: LinkedIn has strict anti-scraping measures, and excessive requests from a single IP address can lead to IP blacklisting. Simply increasing wait times between requests may not be a foolproof solution. To minimize the risk of detection and reduce errors, it's highly advisable to rotate IP addresses periodically. This can be achieved by using proxy services or VPNs, ensuring a more sustainable and uninterrupted scraping process.
Final Words
Web scraping can sometimes violate a website's terms of service, so it's crucial to review and respect the robots.txt file and Terms of Service before scraping any site. In this case, the provided code is intended strictly for educational and hobby purposes. LinkedIn specifically prohibits any data extraction from its website; you can read more here↗.
I encourage using web scraping as a learning tool, but always be mindful of ethical practices. Avoid excessive requests, respect the website's resources, and use the extracted data responsibly.
You can find the complete, updated code in this repository↗, don't doubt to give an star if it helped! 🙏⭐