Automating LinkedIn Job Searches with Puppeteer and RxJS

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1742474806170/99ad1b63-4151-4ee0-99f7-676dfcbfbd6a.png↗ align="center")

Web scraping may seem like a simple task, but there are many challenges to overcome. In this blog, we will dive into how to scrape LinkedIn to extract job listings. To do this, we will use Puppeteer↗ and RxJS.↗ The↗ goal is to achieve web scraping in a declarative, modular, and scalable manner.

What is Web Scraping?

Web scraping is an automated method of extracting valuable data from websites. It allows users to retrieve specific information—such as text, images, links, and structured content—without manually copying and pasting. This technique is widely used for various purposes, including market research, data analysis, job listings aggregation, and competitive intelligence.

By leveraging web scraping tools, developers can efficiently collect, process, and utilize web data, transforming unstructured online information into structured insights.

Puppeteer: A Powerful Web Scraping Tool

Puppeteer is a JavaScript library that provides programmatic control over headless or full browsers like Chrome. It allows developers to automate tasks such as navigating web pages, interacting with elements, and extracting data, making it an excellent choice for web scraping.

One of Puppeteer's biggest advantages is its ability to handle dynamic content. Unlike traditional scraping techniques that rely solely on fetching raw HTML, Puppeteer can execute JavaScript, ensuring that all elements—including those loaded asynchronously—are properly rendered before extraction. This makes it particularly useful for scraping websites with complex structures or content hidden behind interactive elements.

Understanding RxJS

RxJS is a powerful JavaScript library designed for reactive programming, making it easier to handle asynchronous data streams efficiently. In this project, we leverage RxJS due to its numerous advantages:

✅ Streamlined Asynchronous Workflow – Enables a declarative approach to managing async operations.
✅ Robust Error Handling – Provides built-in mechanisms to catch and handle errors gracefully.
✅ Effortless Retry Logic – Allows automatic retries when scraping issues arise.
✅ Flexible and Scalable Code – Simplifies adaptation as project complexity grows.
✅ Extensive Operator Support – Offers a rich set of functions to process and manipulate data efficiently.

1. Puppeteer initialization↗

The↗ code snippet below initialize↗s a↗ Puppeteer browser insta↗nce↗ in a non-headless mo↗de a↗nd subsequently creates a↗ new↗ web page. This represents the most fundamental and straightfo↗rwar↗d initialization process↗ for↗ Puppeteer:

src/index.ts

ts
1(async () => {
2  console.log('Launching Chrome...');
3  const browser = await puppeteer.launch({
4    headless: false,
5    // devtools: true,
6    // slowMo: 250, // slow down puppeteer script so that it's easier to follow visually
7    args: [
8      '--disable-gpu',
9      '--disable-dev-shm-usage',
10      '--disable-setuid-sandbox',
11      '--no-first-run',
12      '--no-sandbox',
13      '--no-zygote',
14      '--single-process',
15    ],
16  });
17
18  const page = await browser.newPage()
19
20    /**
21     * 1. Go lo linkedin jobs url
22     * 2. Get the jobs
23     * 3. Repeat step 1 with other search parameters
24     */

2. Accessing LinkedIn Job Listings and Extracting Data

This is the core section of our blog, where we delve into the process of navigating LinkedIn’s job listings, parsing the HTML content, and extracting job details in a structured JSON format. Our approach ensures that we retrieve relevant job information efficiently while handling potential roadblocks during the scraping process.

2.1. Construct the URL for navigating to LinkedIn job offers page

To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage:

src/linkedin.ts

ts
1export const urlQueryPage = (searchParams: ScraperSearchParams) =>
2    `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
3    &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`

In this case, I have already conducted the necessary research to identify a suitable URL for scraping. Our goal is to find a URL that can be dynamically parameterized based on our desired search criteria.

For this example, the key search parameters will include:

searchText – The job title or keyword.
pageNumber – The pagination index to navigate through job listings.
locationText (optional) – A specific location filter to refine search results.

By structuring the URL accordingly, we can efficiently retrieve job listings that match our specified criteria.

Examples of url can be:

2.2. Navigate to the URL and extract the job offers

With our target URL identified, we can proceed with the two primary actions required:

Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.
Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.

src/linkedin.ts

ts
1
2export interface ScraperSearchParams {
3    searchText: string;
4    locationText: string;
5    pageNumber: number;
6}
7
8/** main function */
9export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
10    return defer(() => fromPromise(navigateToJobsPage(page, searchParams)))
11        .pipe(switchMap(() => getJobsFromLinkedinPage(page)));
12}
13
14/* Utility functions  */
15export const urlQueryPage = (searchParams: ScraperSearchParams) =>
16    `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
17    &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
18
19function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> {
20    return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' });
21}
22
23export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */];
24
25export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> {
26    return defer(() => fromPromise(page.evaluate((pageEvalData) => {
27        const collection: HTMLCollection = document.body.children;
28        const results: JobInterface[] = [];
29        for (let i = 0; i < collection.length; i++) {
30            try {
31                const item = collection.item(i)!;
32                const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim();
33                const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || '';
34                const remoteOk: boolean = !!title.match(/remote|No office location/gi);
35
36                const url = (
37                    (item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement)
38                    || (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement)
39                ).href;
40
41                const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0];
42                const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href;
43                const companyName = companyNameAndLinkContainer.textContent!.trim();
44                const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim();
45
46                const toDate = (dateString: string) => {
47                    const [year, month, day] = dateString.split('-')
48                    return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day)    )
49                }
50
51                const dateTime = (
52                    item.getElementsByClassName('job-search-card__listdate')[0]
53                    || item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case.
54                ).getAttribute('datetime');
55                const postedDate = toDate(dateTime as string).toISOString();
56
57
58                /**
59                 * Calculate minimum and maximum salary
60                 *
61                 * Salary HTML example to parse:
62                 * <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span>
63                 */
64                let currency: SalaryCurrency = ''
65                let salaryMin = -1;
66                let salaryMax = -1;
67
68                const salaryCurrencyMap: any = {
69                    ['€']: 'EUR',
70                    ['$']: 'USD',
71                    ['£']: 'GBP',
72                }
73
74                const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0]
75                if (salaryInfoElem) {
76                    const salaryInfo: string = salaryInfoElem.textContent!.trim();
77                    if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) {
78                        const coinSymbol = salaryInfo.charAt(0);
79                        currency = salaryCurrencyMap[coinSymbol] || coinSymbol;
80                    }
81
82                    const matches = salaryInfo.match(/([0-9]|,|\.)+/g)
83                    if (matches && matches[0]) {
84                        // values are in USA format, so we need to remove ALL the comas
85                        salaryMin = parseFloat(matches[0].replace(/,/g, ''));
86                    }
87                    if (matches && matches[1]) {
88                        // values are in USA format, so we need to remove ALL the comas
89                        salaryMax = parseFloat(matches[1].replace(/,/g, ''));
90                    }
91                }
92
93                // Calculate tags
94                let stackRequired: string[] = [];
95                title.split(' ').concat(url.split('-')).forEach(word => {
96                    if (!!word) {
97                        const wordLowerCase = word.toLowerCase();
98                        if (pageEvalData.stacks.includes(wordLowerCase)) {
99                            stackRequired.push(wordLowerCase)
100                        }
101                    }
102                })
103                // Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts
104                const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos);
105                stackRequired = uniq(stackRequired)
106
107                const result: JobInterface = {
108                    id: item!.children[0].getAttribute('data-entity-urn') as string,
109                    city: companyLocation,
110                    url: url,
111                    companyUrl: companyUrl || '',
112                    img: imgSrc,
113                    date: new Date().toISOString(),
114                    postedDate: postedDate,
115                    title: title,
116                    company: companyName,
117                    location: companyLocation,
118                    salaryCurrency: currency,
119                    salaryMax: salaryMax,
120                    salaryMin: salaryMin,
121                    countryCode: '',
122                    countryText: '',
123                    descriptionHtml: '',
124                    remoteOk: remoteOk,
125                    stackRequired: stackRequired
126                };
127                console.log('result', result);
128
129                results.push(result);
130            } catch (e) {
131                console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack);
132            }
133        }
134        return results;
135    }, {stacks})) as Observable<JobInterface[]>)
136}
137

The code provided extracts the information of all jobs from the page. While it may not be the most aesthetically pleasing code, it gets the job done. It is not aesthetic because parsing this type of HTML always leads to many fallbacks and checks.

In a standard programming context, breaking code into smaller, isolated functions improves readability and maintainability. However, when working with page.evaluate in Puppeteer, we face certain limitations. Since this code executes within the Puppeteer (Chrome) instance rather than our Node.js environment, all logic must be self-contained within the page.evaluate call.

The only exception is simple variables (such as stacks in our case), which can be passed as arguments to page.evaluate. However, these variables must not contain functions or complex objects that cannot be serialized, as Puppeteer does not support passing non-serializable data between Node.js and the browser context.

In this case, the most challenging part of scraping is extracting salary information, as it requires converting a text format like "$65,000.00 - $90,000.00" into separate salaryMin and salaryMax values.

To handle potential issues gracefully, we have encapsulated the entire code within a try/catch block. While we currently log errors to the console, it is highly recommended to implement a mechanism for storing error logs on disk. This is especially important because websites frequently update their structure, requiring regular adjustments to the HTML parsing logic.

Finally, we consistently use the defer and fromPromise operators to convert Promises into Observables, ensuring a reactive and efficient data flow throughout the scraping process.

typescript
1defer(() => fromPromise(myPromise()));

This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link↗ for more information about it

3. Add an asynchronous loop to iterate through all pages

In the previous step, we learned how to obtain all job offers data from a LinkedIn page. Now, we want to use that code as many times as possible to gather as much data as we can. To achieve this, we first need to iterate through all available pages:

src/linkedin.ts

ts
1function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> {
2    const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe(
3        map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)),
4        catchError(error => {
5            console.error(error);
6            return of({jobs: [], searchParams: searchParams})
7        })
8    );
9
10    return getJobs$(initSearchParams).pipe(
11        expand(({jobs, searchParams}) => {
12            console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`);
13            if (jobs.length === 0) {
14                return EMPTY;
15            } else {
16                return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1});
17            }
18        })
19    );
20}

The code above increments the page number until we reach a page where there are no jobs. To perform this loop in RxJS, we use the operator expand, which recursively projects each source value to an Observable that is merged into the output Observable. Its functionality is well explained here↗.

In RxJS, we cannot use a for loop as we do with await/async. We are required to use another technique like expand operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.

So, what would the equivalent code using Promises look like? Here's an example:

typescript
1export async function getJobsFromAllPages(
2  page: Page,
3  searchParams: ScraperSearchParams
4): Promise<ScraperResult> {
5  const results: ScraperResult = { jobs: [], searchParams };
6
7  try {
8    while (true) {
9      const jobs = await getJobsFromLinkedinPage(page, searchParams);
10      console.log(
11        `Linkedin - Query: ${searchParams.searchText}, Location: ${
12          searchParams.locationText
13        }, Page: ${searchParams.nPage}, nJobs: ${
14          jobs.length
15        }, url: ${urlQueryPage(searchParams)}`
16      );
17
18      results.jobs.push(...jobs);
19
20      if (jobs.length === 0) {
21        break;
22      }
23
24      searchParams.nPage++;
25    }
26  } catch (error) {
27    console.error('Error:', error);
28    results.jobs = []; // Clear the jobs in case of an error.
29  }
30
31  return results;
32}

This code is nearly equivalent to the Observable-based one, with one critical difference: it only emits when all pages have finished processing. In contrast, the implementation using Observables emits after each page. Creating a stream is crucial in this case because we want to handle the jobs as soon as they are resolved.

Certainly, we could introduce our logic following the line:

typescript
1const jobs = await getJobsFromLinkedinPage(page, searchParams);
2
3/* Handle the jobs here */

...but this would unnecessarily couple our scraping code with the part that handles the jobs data. Handling the jobs data may involve some transformations, API calls, and finally, saving the data into a database.

In this example, we clearly see one of the many benefits Observables offer over Promises.

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Now that we've established how to iterate through multiple pages for a given search query, it's time to take the next step: expanding our search across multiple search parameters.

To achieve this, we'll introduce an additional asynchronous loop that cycles through various search criteria, ensuring comprehensive data extraction.

The first step is defining a structured data format to store these search parameters. We'll call this list searchParamsList, which will hold different combinations of keywords, locations, or other relevant filters:

src/data.ts

ts
1const searchParamsList: { searchText: string; locationText: string }[] = [
2  { searchText: 'Angular', locationText: 'Barcelona' },
3  { searchText: 'Angular', locationText: 'Madrid' },
4  // ...
5  { searchText: 'React', locationText: 'Barcelona' },
6  { searchText: 'React', locationText: 'Madrid' },
7  // ...
8];

To iterate through the searchParamsList array, we essentially need to convert it from an Array to an Observable using the fromArray operator. Subsequently, we will use the concatMap operator to sequentially process each searchText and locationText pair. The power of RxJS here is that, in the case where we may want to switch from sequential to parallel processing, we just need to change the concatMap for a mergeMap. In this case, it is not recommended because we will exceed LinkedIn's rate limits, but it's something to consider in other scenarios.

src/linkedin.ts

ts
1/**
2 * Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages.
3 * @param browser A Puppeteer instance
4 * @returns An Observable that emits scraped job offers data as ScraperResult
5 */
6export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> {
7    // Create a new page
8    const createPage = defer(() => fromPromise(browser.newPage()));
9
10    // Iterate through search parameters and scrape jobs
11    const scrapeJobs = (page: Page): Observable<ScraperResult> =>
12        fromArray(searchParamsList).pipe(
13            concatMap(({ searchText, locationText }) =>
14                getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 })
15            )
16        )
17
18    // Compose sequentially previous steps
19    return createPage.pipe(switchMap(page => scrapeJobs(page)));
20}

This code will loop through different search parameters, retrieving job listings for each combination of technology and location efficiently.

🎉 Congratulations! You now have the skills to scrape LinkedIn job postings! 🎉

However, like many other platforms, LinkedIn employs anti-scraping measures to prevent automated data extraction. Let’s explore how to handle these challenges 👇

Common Errors When Scraping LinkedIn

Running the code as it is will quickly lead to various errors, making it challenging to scrape a substantial amount of data. The two most common issues are:

1. 429 Status Code (Too Many Requests)

This error occurs when we send too many requests in a short period. To avoid being blocked, we need to slow down the request rate and introduce random delays until the error subsides.

2. LinkedIn Authwall

Occasionally, instead of the job listings page, LinkedIn may redirect us to an authentication wall. When this happens, the best approach is to pause requests for a while before trying again.

Handling 429 Errors & LinkedIn Authwall

To tackle these issues, we modify the getJobsFromLinkedinPage function by introducing a separate function, getLinkedinJobsFromJobsPage, to handle the HTML scraping logic. The updated code structure looks like this:

src/linkedin.ts

ts
1const AUTHWALL_PATH = 'linkedin.com/authwall';
2const STATUS_TOO_MANY_REQUESTS = 429;
3const JOB_SEARCH_SELECTOR = '.job-search-card';
4
5function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
6    return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'})))
7        .pipe(
8            switchMap(() => navigateToLinkedinJobsPage(page, searchParams)),
9            tap(response => checkResponseStatus(response)),
10            switchMap(() => throwErrorIfAuthwall(page)),
11            switchMap(() => waitForJobSearchCard(page)),
12            switchMap(() => getJobsFromLinkedinPage(page)),
13            retryWhen(retryStrategyByCondition({
14                maxRetryAttempts: 4,
15                retryConditionFn: error => error.retry === true
16            })),
17            map(jobs =>  Array.isArray(jobs) ? jobs : []),
18            take(1)
19        );
20}
21
22/**
23 * Navigate to the LinkedIn search page, using the provided search parameters.
24 */
25function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) {
26    return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'})));
27}
28
29/**
30 * Check the HTTP response status and throw an error if too many requests have been made.
31 */
32function checkResponseStatus(response: any) {
33    const status = response?.status();
34    if (status === STATUS_TOO_MANY_REQUESTS) {
35        throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS};
36    }
37}
38
39/**
40 * Check if the current page is an authwall and throw an error if it is.
41 */
42function throwErrorIfAuthwall(page: Page) {
43    return getPageLocationOperator(page).pipe(tap(locationHref => {
44        if (locationHref.includes(AUTHWALL_PATH)) {
45            console.error('Authwall error');
46            throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true};
47        }
48    }));
49}
50
51/**
52 * Wait for the job search card to be visible on the page, and handle timeouts or authwalls.
53 */
54function waitForJobSearchCard(page: Page) {
55    return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe(
56        catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error})))
57    );
58}

In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.

To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition↗ function:

src/scraper.utils.ts

ts
1export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: {
2    maxRetryAttempts?: number,
3    scalingDuration?: number,
4    retryConditionFn?: (error) => boolean
5} = {}) => (attempts: Observable<any>) => {
6    return attempts.pipe(
7        mergeMap((error, i) => {
8            const retryAttempt = i + 1;
9            if (
10                retryAttempt > maxRetryAttempts ||
11                !retryConditionFn(error)
12            ) {
13                return throwError(error);
14            }
15            console.log(
16                `Attempt ${retryAttempt}: retrying in ${retryAttempt *
17                scalingDuration}ms`
18            );
19            // retry after 1s, 2s, etc...
20            return timer(retryAttempt * scalingDuration);
21        }),
22        finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized'))
23    );
24};

This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again

⚠️ Important Note: LinkedIn has strict anti-scraping measures, and excessive requests from a single IP address can lead to IP blacklisting. Simply increasing wait times between requests may not be a foolproof solution. To minimize the risk of detection and reduce errors, it's highly advisable to rotate IP addresses periodically. This can be achieved by using proxy services or VPNs, ensuring a more sustainable and uninterrupted scraping process.

Final Words

Web scraping can sometimes violate a website's terms of service, so it's crucial to review and respect the robots.txt file and Terms of Service before scraping any site. In this case, the provided code is intended strictly for educational and hobby purposes. LinkedIn specifically prohibits any data extraction from its website; you can read more here↗.

I encourage using web scraping as a learning tool, but always be mindful of ethical practices. Avoid excessive requests, respect the website's resources, and use the extracted data responsibly.

You can find the complete, updated code in this repository↗, don't doubt to give an star if it helped! 🙏⭐

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1742474806170/99ad1b63-4151-4ee0-99f7-676dfcbfbd6a.png↗ align="center")

What is Web Scraping?

By leveraging web scraping tools, developers can efficiently collect, process, and utilize web data, transforming unstructured online information into structured insights.

Puppeteer: A Powerful Web Scraping Tool

Understanding RxJS

1. Puppeteer initialization↗

src/index.ts

ts
1(async () => {
2  console.log('Launching Chrome...');
3  const browser = await puppeteer.launch({
4    headless: false,
5    // devtools: true,
6    // slowMo: 250, // slow down puppeteer script so that it's easier to follow visually
7    args: [
8      '--disable-gpu',
9      '--disable-dev-shm-usage',
10      '--disable-setuid-sandbox',
11      '--no-first-run',
12      '--no-sandbox',
13      '--no-zygote',
14      '--single-process',
15    ],
16  });
17
18  const page = await browser.newPage()
19
20    /**
21     * 1. Go lo linkedin jobs url
22     * 2. Get the jobs
23     * 3. Repeat step 1 with other search parameters
24     */

2. Accessing LinkedIn Job Listings and Extracting Data

2.1. Construct the URL for navigating to LinkedIn job offers page

To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage:

src/linkedin.ts

ts
1export const urlQueryPage = (searchParams: ScraperSearchParams) =>
2    `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
3    &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`

For this example, the key search parameters will include:

searchText – The job title or keyword.
pageNumber – The pagination index to navigate through job listings.
locationText (optional) – A specific location filter to refine search results.

By structuring the URL accordingly, we can efficiently retrieve job listings that match our specified criteria.

Examples of url can be:

2.2. Navigate to the URL and extract the job offers

With our target URL identified, we can proceed with the two primary actions required:

Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.
Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.

src/linkedin.ts

ts
1
2export interface ScraperSearchParams {
3    searchText: string;
4    locationText: string;
5    pageNumber: number;
6}
7
8/** main function */
9export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
10    return defer(() => fromPromise(navigateToJobsPage(page, searchParams)))
11        .pipe(switchMap(() => getJobsFromLinkedinPage(page)));
12}
13
14/* Utility functions  */
15export const urlQueryPage = (searchParams: ScraperSearchParams) =>
16    `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
17    &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
18
19function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> {
20    return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' });
21}
22
23export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */];
24
25export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> {
26    return defer(() => fromPromise(page.evaluate((pageEvalData) => {
27        const collection: HTMLCollection = document.body.children;
28        const results: JobInterface[] = [];
29        for (let i = 0; i < collection.length; i++) {
30            try {
31                const item = collection.item(i)!;
32                const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim();
33                const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || '';
34                const remoteOk: boolean = !!title.match(/remote|No office location/gi);
35
36                const url = (
37                    (item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement)
38                    || (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement)
39                ).href;
40
41                const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0];
42                const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href;
43                const companyName = companyNameAndLinkContainer.textContent!.trim();
44                const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim();
45
46                const toDate = (dateString: string) => {
47                    const [year, month, day] = dateString.split('-')
48                    return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day)    )
49                }
50
51                const dateTime = (
52                    item.getElementsByClassName('job-search-card__listdate')[0]
53                    || item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case.
54                ).getAttribute('datetime');
55                const postedDate = toDate(dateTime as string).toISOString();
56
57
58                /**
59                 * Calculate minimum and maximum salary
60                 *
61                 * Salary HTML example to parse:
62                 * <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span>
63                 */
64                let currency: SalaryCurrency = ''
65                let salaryMin = -1;
66                let salaryMax = -1;
67
68                const salaryCurrencyMap: any = {
69                    ['€']: 'EUR',
70                    ['$']: 'USD',
71                    ['£']: 'GBP',
72                }
73
74                const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0]
75                if (salaryInfoElem) {
76                    const salaryInfo: string = salaryInfoElem.textContent!.trim();
77                    if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) {
78                        const coinSymbol = salaryInfo.charAt(0);
79                        currency = salaryCurrencyMap[coinSymbol] || coinSymbol;
80                    }
81
82                    const matches = salaryInfo.match(/([0-9]|,|\.)+/g)
83                    if (matches && matches[0]) {
84                        // values are in USA format, so we need to remove ALL the comas
85                        salaryMin = parseFloat(matches[0].replace(/,/g, ''));
86                    }
87                    if (matches && matches[1]) {
88                        // values are in USA format, so we need to remove ALL the comas
89                        salaryMax = parseFloat(matches[1].replace(/,/g, ''));
90                    }
91                }
92
93                // Calculate tags
94                let stackRequired: string[] = [];
95                title.split(' ').concat(url.split('-')).forEach(word => {
96                    if (!!word) {
97                        const wordLowerCase = word.toLowerCase();
98                        if (pageEvalData.stacks.includes(wordLowerCase)) {
99                            stackRequired.push(wordLowerCase)
100                        }
101                    }
102                })
103                // Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts
104                const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos);
105                stackRequired = uniq(stackRequired)
106
107                const result: JobInterface = {
108                    id: item!.children[0].getAttribute('data-entity-urn') as string,
109                    city: companyLocation,
110                    url: url,
111                    companyUrl: companyUrl || '',
112                    img: imgSrc,
113                    date: new Date().toISOString(),
114                    postedDate: postedDate,
115                    title: title,
116                    company: companyName,
117                    location: companyLocation,
118                    salaryCurrency: currency,
119                    salaryMax: salaryMax,
120                    salaryMin: salaryMin,
121                    countryCode: '',
122                    countryText: '',
123                    descriptionHtml: '',
124                    remoteOk: remoteOk,
125                    stackRequired: stackRequired
126                };
127                console.log('result', result);
128
129                results.push(result);
130            } catch (e) {
131                console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack);
132            }
133        }
134        return results;
135    }, {stacks})) as Observable<JobInterface[]>)
136}
137

In a standard programming context, breaking code into smaller, isolated functions improves readability and maintainability. However, when working with page.evaluate in Puppeteer, we face certain limitations. Since this code executes within the Puppeteer (Chrome) instance rather than our Node.js environment, all logic must be self-contained within the page.evaluate call.

The only exception is simple variables (such as stacks in our case), which can be passed as arguments to page.evaluate. However, these variables must not contain functions or complex objects that cannot be serialized, as Puppeteer does not support passing non-serializable data between Node.js and the browser context.

Finally, we consistently use the defer and fromPromise operators to convert Promises into Observables, ensuring a reactive and efficient data flow throughout the scraping process.

typescript
1defer(() => fromPromise(myPromise()));

This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link↗ for more information about it

3. Add an asynchronous loop to iterate through all pages

src/linkedin.ts

ts
1function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> {
2    const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe(
3        map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)),
4        catchError(error => {
5            console.error(error);
6            return of({jobs: [], searchParams: searchParams})
7        })
8    );
9
10    return getJobs$(initSearchParams).pipe(
11        expand(({jobs, searchParams}) => {
12            console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`);
13            if (jobs.length === 0) {
14                return EMPTY;
15            } else {
16                return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1});
17            }
18        })
19    );
20}

In RxJS, we cannot use a for loop as we do with await/async. We are required to use another technique like expand operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.

So, what would the equivalent code using Promises look like? Here's an example:

typescript
1export async function getJobsFromAllPages(
2  page: Page,
3  searchParams: ScraperSearchParams
4): Promise<ScraperResult> {
5  const results: ScraperResult = { jobs: [], searchParams };
6
7  try {
8    while (true) {
9      const jobs = await getJobsFromLinkedinPage(page, searchParams);
10      console.log(
11        `Linkedin - Query: ${searchParams.searchText}, Location: ${
12          searchParams.locationText
13        }, Page: ${searchParams.nPage}, nJobs: ${
14          jobs.length
15        }, url: ${urlQueryPage(searchParams)}`
16      );
17
18      results.jobs.push(...jobs);
19
20      if (jobs.length === 0) {
21        break;
22      }
23
24      searchParams.nPage++;
25    }
26  } catch (error) {
27    console.error('Error:', error);
28    results.jobs = []; // Clear the jobs in case of an error.
29  }
30
31  return results;
32}

Certainly, we could introduce our logic following the line:

typescript
1const jobs = await getJobsFromLinkedinPage(page, searchParams);
2
3/* Handle the jobs here */

In this example, we clearly see one of the many benefits Observables offer over Promises.

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Now that we've established how to iterate through multiple pages for a given search query, it's time to take the next step: expanding our search across multiple search parameters.

To achieve this, we'll introduce an additional asynchronous loop that cycles through various search criteria, ensuring comprehensive data extraction.

src/data.ts

ts
1const searchParamsList: { searchText: string; locationText: string }[] = [
2  { searchText: 'Angular', locationText: 'Barcelona' },
3  { searchText: 'Angular', locationText: 'Madrid' },
4  // ...
5  { searchText: 'React', locationText: 'Barcelona' },
6  { searchText: 'React', locationText: 'Madrid' },
7  // ...
8];

src/linkedin.ts

ts
1/**
2 * Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages.
3 * @param browser A Puppeteer instance
4 * @returns An Observable that emits scraped job offers data as ScraperResult
5 */
6export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> {
7    // Create a new page
8    const createPage = defer(() => fromPromise(browser.newPage()));
9
10    // Iterate through search parameters and scrape jobs
11    const scrapeJobs = (page: Page): Observable<ScraperResult> =>
12        fromArray(searchParamsList).pipe(
13            concatMap(({ searchText, locationText }) =>
14                getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 })
15            )
16        )
17
18    // Compose sequentially previous steps
19    return createPage.pipe(switchMap(page => scrapeJobs(page)));
20}

This code will loop through different search parameters, retrieving job listings for each combination of technology and location efficiently.

🎉 Congratulations! You now have the skills to scrape LinkedIn job postings! 🎉

However, like many other platforms, LinkedIn employs anti-scraping measures to prevent automated data extraction. Let’s explore how to handle these challenges 👇

Common Errors When Scraping LinkedIn

Running the code as it is will quickly lead to various errors, making it challenging to scrape a substantial amount of data. The two most common issues are:

1. 429 Status Code (Too Many Requests)

This error occurs when we send too many requests in a short period. To avoid being blocked, we need to slow down the request rate and introduce random delays until the error subsides.

2. LinkedIn Authwall

Occasionally, instead of the job listings page, LinkedIn may redirect us to an authentication wall. When this happens, the best approach is to pause requests for a while before trying again.

Handling 429 Errors & LinkedIn Authwall

src/linkedin.ts

ts
1const AUTHWALL_PATH = 'linkedin.com/authwall';
2const STATUS_TOO_MANY_REQUESTS = 429;
3const JOB_SEARCH_SELECTOR = '.job-search-card';
4
5function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
6    return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'})))
7        .pipe(
8            switchMap(() => navigateToLinkedinJobsPage(page, searchParams)),
9            tap(response => checkResponseStatus(response)),
10            switchMap(() => throwErrorIfAuthwall(page)),
11            switchMap(() => waitForJobSearchCard(page)),
12            switchMap(() => getJobsFromLinkedinPage(page)),
13            retryWhen(retryStrategyByCondition({
14                maxRetryAttempts: 4,
15                retryConditionFn: error => error.retry === true
16            })),
17            map(jobs =>  Array.isArray(jobs) ? jobs : []),
18            take(1)
19        );
20}
21
22/**
23 * Navigate to the LinkedIn search page, using the provided search parameters.
24 */
25function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) {
26    return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'})));
27}
28
29/**
30 * Check the HTTP response status and throw an error if too many requests have been made.
31 */
32function checkResponseStatus(response: any) {
33    const status = response?.status();
34    if (status === STATUS_TOO_MANY_REQUESTS) {
35        throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS};
36    }
37}
38
39/**
40 * Check if the current page is an authwall and throw an error if it is.
41 */
42function throwErrorIfAuthwall(page: Page) {
43    return getPageLocationOperator(page).pipe(tap(locationHref => {
44        if (locationHref.includes(AUTHWALL_PATH)) {
45            console.error('Authwall error');
46            throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true};
47        }
48    }));
49}
50
51/**
52 * Wait for the job search card to be visible on the page, and handle timeouts or authwalls.
53 */
54function waitForJobSearchCard(page: Page) {
55    return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe(
56        catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error})))
57    );
58}

In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.

To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition↗ function:

src/scraper.utils.ts

ts
1export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: {
2    maxRetryAttempts?: number,
3    scalingDuration?: number,
4    retryConditionFn?: (error) => boolean
5} = {}) => (attempts: Observable<any>) => {
6    return attempts.pipe(
7        mergeMap((error, i) => {
8            const retryAttempt = i + 1;
9            if (
10                retryAttempt > maxRetryAttempts ||
11                !retryConditionFn(error)
12            ) {
13                return throwError(error);
14            }
15            console.log(
16                `Attempt ${retryAttempt}: retrying in ${retryAttempt *
17                scalingDuration}ms`
18            );
19            // retry after 1s, 2s, etc...
20            return timer(retryAttempt * scalingDuration);
21        }),
22        finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized'))
23    );
24};

This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again

⚠️ Important Note: LinkedIn has strict anti-scraping measures, and excessive requests from a single IP address can lead to IP blacklisting. Simply increasing wait times between requests may not be a foolproof solution. To minimize the risk of detection and reduce errors, it's highly advisable to rotate IP addresses periodically. This can be achieved by using proxy services or VPNs, ensuring a more sustainable and uninterrupted scraping process.

Final Words

I encourage using web scraping as a learning tool, but always be mindful of ethical practices. Avoid excessive requests, respect the website's resources, and use the extracted data responsibly.

You can find the complete, updated code in this repository↗, don't doubt to give an star if it helped! 🙏⭐

Automating LinkedIn Job Searches with Puppeteer and RxJS

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

What is Web Scraping?

Puppeteer: A Powerful Web Scraping Tool

Understanding RxJS

1. Puppeteer initialization↗

2. Accessing LinkedIn Job Listings and Extracting Data

2.1. Construct the URL for navigating to LinkedIn job offers page

2.2. Navigate to the URL and extract the job offers

3. Add an asynchronous loop to iterate through all pages

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Common Errors When Scraping LinkedIn

1. 429 Status Code (Too Many Requests)

2. LinkedIn Authwall

Handling 429 Errors & LinkedIn Authwall

Final Words

Related Articles

Categories

Automating LinkedIn Job Searches with Puppeteer and RxJS

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

What is Web Scraping?

Puppeteer: A Powerful Web Scraping Tool

Understanding RxJS

1. Puppeteer initialization↗

2. Accessing LinkedIn Job Listings and Extracting Data

2.1. Construct the URL for navigating to LinkedIn job offers page

2.2. Navigate to the URL and extract the job offers

3. Add an asynchronous loop to iterate through all pages

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Common Errors When Scraping LinkedIn

1. 429 Status Code (Too Many Requests)

2. LinkedIn Authwall

Handling 429 Errors & LinkedIn Authwall

Final Words

Related Articles

Categories